Artificial data

First, generate data:

set.seed(0)
datobj = generate_data_generic(p=5, TT=300, fac=.5, nt=2000, dimdat = 3)
ylist = datobj$ylist
X = datobj$X

This produces three dimensional cytograms ylist and covariates X.

ylist is a list of length \(T=300\), the number of time points (or cytograms). Each element of ylist is an array with \(d=3\) rows (a single cytogram) and \(n_t\) columns. The number of columns \(n_t\) of each element in ylist can be different.

X is a \(T \times d\) matrix, whose \(t\)’th rows contain the relevant (environmental) factors of the \(t\)’th cytogram.

plot(ylist[[1]][,1:2], ylab="", xlab="", pch=16, col=rgb(0,0,1,0.2), cex=.5)

Especially if your data is a time series, it could be useful to plot the covariates once across time.

matplot(X, type = 'l')

Now, we estimate the data with fixed regularization parameters \(\lambda_\alpha=0.01\) and \(\lambda_\beta=0.01\), and \(K=10\) clusters.

Internally, flowmix() repeats the estimation 5 (the default) times, and returns the estimated model providing the best data fit.

numclust = 4
set.seed(0)
res = flowmix(ylist = ylist, X = X, numclust = numclust,
              mean_lambda = 0.001, prob_lambda = 0.001,
              nrep = 1)
print(res)
## [1] "47 out of 60 regression coefficients for the cluster centers, are zero."
## [1] "16 out of 20 regression coefficients for the cluster probabilities, are zero."
## [1] "The algorithm converged in 19 EM iterations."

The cluster probabilities look like this:

plot_prob(res)

Showing the model across time, in an animation:

par(mfrow = c(1,3), oma = c(2,2,2,2))
ylim = c(-3,8)
xlim = c(-5,8)
for(tt in 1:res$TT){
  for(dims in list(c(1,2), c(2,3), c(3,1))){
    scatterplot_2d(ylist, res, tt, dims = dims, cex_fac=1, ylim=ylim, xlim=xlim)
  }
  mtext(outer = TRUE,
        text = paste0("t=", tt, " out of ", res$TT),
        cex = 2)
}

Cross-validation

Obtain the maximum lambda values

The maximum values for the candidate regularization parameters \(\lambda_\alpha\) and \(\lambda_\beta\), to be used for cross-validation, can be numerically obtained:

maxres = get_max_lambda(destin,
                        "maxres.Rdata",
                        ylist = ylist,
                        countslist = NULL,
                        X = X,
                        numclust = 4,
                        maxdev = 0.5,
                        max_mean_lambda = 40,
                        max_prob_lambda = 2)

Now setting a few things up.

## Define the locations to save the CV.
destin = "." 

## Define the CV folds (as every fifth, nfold-sized, block of indices)
folds = make_cv_folds(ylist, nfold = 5, verbose = FALSE, blocksize = 20) 

## Define the candidate lambda values (logarithmically spaced)
cv_gridsize = 5
## maxres = list(alpha = 1, beta=1)
prob_lambdas =  logspace(min = 0.0001, max = maxres$alpha, length = cv_gridsize)
mean_lambdas = logspace(min = 0.0001, max = maxres$beta, length = cv_gridsize)

One EM algorithm = one job

Next, what we call “one job” (using the function one_job(ialpha, ibeta, ifold, irep)) is to run the EM algorithm once, for:

  • the ialpha’th \(\lambda_\alpha\) value (out of prob_lambdas).
  • the ibeta’th \(\lambda_\alpha\) value (out of mean_lambdas).
  • the ifold’th test fold out of the nfold=5 CV folds.
  • the irep’th repeat of the EM algorithm (nrep=10 in total)
## Example of one CV job for one pair of regularization parameters (and CV folds
## and EM replicates)
ialpha = 1
ibeta = 1
ifold = 1
irep = 1
destin = "~/Desktop"## Change to your target destination.
one_job(ialpha = ialpha,
        ibeta = ibeta,
        ifold = ifold,
        irep = irep,
        folds = folds,
        destin = destin,
        mean_lambda = mean_lambdas, prob_lambdas = prob_lambdas,
        ## The rest that is needed explicitly for flowmix()
        ylist = ylist,
        countslist = NULL,
        X = X,
        numclust = 4,
        maxdev = 0.5,
        ## verbose = TRUE
        )

Also, the nrep estimated models for any given ialpha and ibeta (in the full data) are obtained using one_job_refit():

## Example of one replicate of model estimation (in the full data) for one pair
## of regularization parameters.
ialpha = 1
ibeta = 1
irep = 1
destin = "~/Desktop"## Change to your target destination.
one_job_refit(ialpha = ialpha,
              ibeta = ibeta,
              irep = irep,
              destin = destin,
              mean_lambda = mean_lambdas, prob_lambdas = prob_lambdas,
              ## The rest that is needed explicitly for flowmix()
              ylist = ylist,
              countslist = NULL,
              X = X,
              numclust = 4,
              maxdev = 0.5,
              )

(Since all of this is clearly parallelizable, it’s recommended to use multiple computers or servers for the full cross-validation.)

A single function

cv.flowmix conducts cross-validation by wrapping most of the above into a single function.

(This code takes long, so it’s recommended that you run it separately in a script; use mc.cores option to run the jobs on multiple cores):

cvres = cv.flowmix(ylist = ylist,
                   countslist = NULL,
                   X = X,
                   maxdev = 0.5,
                   numclust = 4,
                   prob_lambdas = prob_lambdas,
                   mean_lambdas = mean_lambdas,
                   nrep = 5,
                   nfold = 5,
                   destin = "~/Desktop",
                   mc.cores = 8)

Then, the results are saved into separate files whose names follow these rules: - “1-1-1-1.Rdata” for ialpha-ibeta-irep-ifold.Rdata, having run the CV. - “1-1-1-cvres.Rdata” for having estimated the model in the full data

Then, the results are summarized, and optionally saved to a file summary.RDS, if save=TRUE:

cvres = cv_summary(destin = ".",
                   cv_gridsize = 5,
                   nrep = 5,
                   nfold = 5,
                   save = TRUE,
                   filename = "summary.RDS")

The model chosen by cross-validation is this:

cvres$bestres %>% print()

Binning data

If the data contains too many particles, we can reduce the size of ylist and instead deal with binned counts.

The new object countslist can be additionally input to flowmix().

Here is an example. make_grid(ylist, gridsize=30) makes an equally sized grid of size 30 from the data range, in each dimension. Then, bin_many_cytograms() places the particles in ylist in each of these bins. The resulting object is a list which contains the grid centers ybin_list and the counts in each counts_list.

(Not used here, but optionally, you can upweight each particle, e.g. using the biomass of each particle, for instance.)

## Bin this data
grid = make_grid(ylist, gridsize = 30, grid.ind=FALSE)
obj = bin_many_cytograms(ylist, grid, mc.cores = 8, verbose=FALSE)  
ylist = obj$ybin_list
countslist = obj$counts_list

## Run the algorithm on binned data
res = flowmix(ylist = ylist,
              X = X,
              countslist = countslist,
              numclust = numclust,
              mean_lambda = 0.001,
              prob_lambda = 0.001,
              verbose = FALSE,
              maxdev = 0.5)

Real data

You can repeat the above code blocks, with real data.

## Load data
load(file = "~/repos/flowmix/demo-MGL1704.Rdata")
X = X %>% select(-time, -lat, -lon) %>% as.matrix()
ylist = ybin_list
countslist = biomass_list

## Estimate model
set.seed(1)
res = flowmix(ylist, X, numclust = 10,
              countslist = countslist,
              mean_lambda = 0.001,
              prob_lambda = 0.001,
              maxdev = 0.5,
              nrep = 1,
              verbose = FALSE)

Now visualizing the results as before.

## Default print
print(res)
## [1] "445 out of 1110 regression coefficients for the cluster centers, are zero."
## [1] "274 out of 370 regression coefficients for the cluster probabilities, are zero."
## [1] "The algorithm converged in 44 EM iterations."
## Plot estimated probabilities
plot_prob(res)
## Three scatterplots of one time point
par(mfrow = c(1,3))
ylim = c(-3,8)
xlim = c(-5,8)
dimnames = c("diam", "red", "orange")
par(mfrow = c(1,3), oma = c(2,2,2,2))
for(tt in 1:50){
  for(dims in list(c(1,2), c(2,3), c(3,1))){
    scatterplot_2d(ylist = ylist,
                   countslist = countslist,
                   obj = res,
                   tt,
                   dims = dims, cex_fac=8,
                   pt_col = rgb(0 ,0, 1, 0.1),
                   xlab = dimnames[dims[1]],
                   ylab = dimnames[dims[2]])
  }
  mtext(outer = TRUE,
        text = paste0("t=", tt, " out of ", res$TT),
        cex = 2)
}